Processor having execution core sections operating at different clock rates

ABSTRACT

A processor including a first execution core section clocked to perform execution operations at a first clock frequency, and a second execution core section clocked to perform execution operations at a second clock frequency which is different than the first clock frequency. The second execution core section runs faster and includes a data cache and critical ALU functions, while the first execution core section includes latency-tolerant functions such as instruction fetch and decode units and non-critical ALU functions. The processor may further include an I/O ring which may be still slower than the first execution core section. Optionally, the first execution core section may include a third execution core section whose clock rate is between that of the first and second execution core sections. Clock multipliers/dividers may be used between the various sections to derive their clocks from a single source, such as the I/O clock.

This application is a reissue divisional of application Ser. No.10/996,328, filed Nov. 24, 2004; which is a reissue application ofapplication Ser. No. 09/775,383, filed Feb. 2, 2001, now U.S. Pat. No.6,487,675; which is a continuation of application Ser. No. 09/527,065,filed Mar. 16, 2000, entitled “Processor Having Execution Core SectionsOperating at Different Clock Rates”, now U.S. Pat. No. 6,256,745; whichwas a continuation of Ser. No. 09/092,353, filed Jun. 5, 1998, entitled“Processor Having Execution Core Sections Operating at Different ClockRates”, now U.S. Pat. No. 6,216,234; which was a continuation of Ser.No. 08/746,606, filed Nov. 13, 1996, entitled “Processor HavingExecution Core Sections Operating at Different Clock Rates”, now U.S.Pat. No. 5,828,868.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention relates generally to the field of high speedprocessors, and more specifically to a processor including a sub-coreoperating at a higher frequency than the rest of the execution core, andalso to a replay architecture for facilitating data-speculatingoperation of the sub-core.

2. Background of the Prior Art

FIG. 1 illustrates a microprocessor 100 according to the prior art. Themicroprocessor includes an I/O ring which operates at a first clockfrequency, and an execution core which operates at a second clockfrequency. For example, the Intel186DX2 may run its I/O ring at 33 MHzand its execution core at 66 MHz for a 2:1 ratio (1/2 bus), the IntelDX4may run its I/O ring at 25 MHz and its execution core at 75 MHz for a3:1 ratio (1/3 bus), and the Intel Pentium® OverDrive® processor mayoperate its I/O ring at 33 MHz and its execution core at 82.5 MHz for a2.5:1 ratio (5/2 bus).

A distinction may be made between “I/O operations” and “executionoperations”. For example, in the DX2, the I/O ring performs I/Ooperations such as buffering, bus driving, receiving, parity checking,and other operations associated with communicating with the off-chipworld, while the execution core performs execution operations such asaddition, multiplication, address generation, comparisons, rotation andshifting, and other “processing” manipulations.

The processor 100 may optionally include a clock multiplier. In thismode, the processor can automatically set the speed of its executioncore according to an external, slower clock provided to its I/O ring.This may reduce the number of pins needed. Alternatively, the processormay include a clock divider, in which case the processor sets the I/Oring speed responsive to an external clock provided to the executioncore.

These clock multiply and clock divide functions are logically the samefor the purposes of this invention, so the term “clock mult/div” will beused herein to denote either a multiplier or divider as suitable. Theskilled reader will comprehend how external clocks may be selected andprovided, and from there multiplied or divided. Therefore, specificclock distribution networks, and the details of clock multiplication anddivision, will not be expressly illustrated. Furthermore, the clockmult/div units need not necessarily be limited to integer multipleclocks, but can perform e.g. 2:5 clocking. Finally, the clock mult/divunits need not necessarily even be limited to fractional bus clocking,but can, in some embodiments, be flexible, asynchronous, and/orprogrammable, such as in providing a P/Q clocking scheme.

The basic motivation for increasing clock frequencies in this manner isto reduce instruction latency. The execution latency of an instructionmay be defined as the time from when its input operands must be readyfor it to execute until its result is ready to be used by anotherinstruction. Suppose that a part of a program contains a sequence of Ninstructions, I₁, I₂, I₃, . . . , I_(N). Suppose that I_(n+1) requires,as part of its inputs, the result of I_(n), for all n, from 1 to N−1.This part of the program may also contain any other instructions. Thenwe can see that this program cannot be executed in less time thanT=L₁,+L₂+L₃+. . .+L_(N), where L_(n) is the latency of instructionI_(n), for all n from 1 to N. In fact, even if the processor was capableof executing a very large number of instructions in parallel, T remainsa lower bound for the time to execute this part of this program. Henceto execute this program faster, it will ultimately be essential toshorten the latencies of the instructions.

We may look at the same thing from a slightly different point of view.Define that an instruction I_(n) is “in flight” from the time that itrequires its input operands to be ready until the time when its resultis ready to be used by another instruction. Instruction I_(n) istherefore “in flight” for a length of time L_(n)=A_(n)*C where A_(n) isthe latency, as defined above, of In, but this time expressed in cycles.C is the cycle time. Let a program execute N instructions as above andtake M “cycles” or units of time to do it. Looked at from either pointof view, it is critically important to reduce the execution latency asmuch as possible.

The average latency can be conventionally defined as 1/N*(L₁+L₂+L₃+ . .. +L_(N))=C/N*(A₁+A₂+A₃+ . . . +A_(N)). Let f_(j) be the number ofinstructions that are in flight during cycle j. We can then define theparallelism P as the average number of instructions in flight for theprogram or 1/M*(f₁+f₂+f₃+ . . . +f_(M)).

Notice that f₁+f₂+f₃+ . . . +f_(M)=A₁+A₂+A₃+ . . . +A_(N). Both sides ofthis equation are ways of counting up the number of cycles in whichinstructions are in flight, wherein if x instructions are in flight in agiven cycle, that cycle counts as x cycles.

Now define the “average bandwidth” B as the total number of instructionsexecuted, N, divided by the time used, M*C, or in other words,B=N/(M*C).

We may then easily see that P=L*B. In this formula, L is the averagelatency for a program, B is its average bandwidth, and P is its averageParallelism. Note that B tells how fast we execute the program. It isinstructions per second. If the program has N instructions, it takes N/Bseconds to execute it. The goal of a faster processor is exactly thegoal of getting B higher.

We now note that increasing B requires either increasing the parallelismP, or decreasing the average latency L. It is well known that theparallelism, P, that can be readily exploited for a program is limited.Whereas, it is true that certain classes of programs have largeexploitable parallelism, a large class of important programs has Prestricted to quite small numbers.

One drawback which the prior art processors have is that their entireexecution core is constrained to run at the same clock speed. Thislimits some components within the core in a “weakest link” or “slowestpath” manner.

In the 1960s and 1970s, there existed central processing units in whicha multiplier or divider co-processor was clocked at a frequency higherthan other circuitry in the central processing unit. These centralprocessing units were constructed of discrete components rather than asintegrated circuits or monolithic microprocessors. Due to theirconstruction as co-processors, and/or the fact that they were notintegrated with the main processor, these units should not be consideredas “sub-cores”.

Another feature of some prior art processors is the ability to perform“speculative execution”. This is also known as “control speculation”,because the processor guesses which way control (branching) instructionswill go. Some processors perform speculative fetch, and others, such asthe Intel Pentium Pro processor, also perform speculative execution.Control speculating processors include mechanisms for recovering frommispredicted branches, to maintain program and data integrity as thoughno speculation were taking place.

FIG. 2 illustrates a conventional data hierarchy. A mass storage device,such as a hard drive, stores the programs and data (collectively “data”)which the computer system (not shown) has at its disposal. A subset ofthat data is loaded into memory such as DRAM for faster access. A subsetof the DRAM contents may be held in a cache memory. The cache memory mayitself be hierarchical, and may include a level two (L2) cache, and thena level one (L1) cache which holds a subset of the data from the L2.Finally, the physical registers of the processor contain a smallestsubset of the data. As is well known, various algorithms may be used todetermine what data is stored in what levels of this overall hierarchy.In general, it may be said that the more recently a datum has been used,or the more likely it is to be needed soon, the closer it will be heldto the processor.

The presence or absence of valid data at various points in thehierarchical storage structure has implications on another drawback ofthe prior art processors, including control speculating processors. Thevarious components within their execution cores are designed such thatthey cannot perform “data speculation”, in which a processor guesseswhat values data will have (or, more precisely, the processor assumesthat presently-available data values are correct and identical to thevalues that will ultimately result, and uses those values as inputs forone or more operations), rather than which way branches will go. Dataspeculation may involve speculating that data presently available from acache are identical to the true values that those data should have, orthat data presently available at the output of some execution unit areidentical to the true values that will result when the execution unitcompletes its operation, or the like.

Like control speculating processors' recovery mechanisms, dataspeculating processors must have some mechanism for recovering fromhaving incorrectly assumed that data values are correct, to maintainprogram and data integrity as though no data speculation were takingplace. Data speculation is made more difficult by the hierarchicalstorage system, especially when it is coupled with a microarchitecturewhich uses different clock frequencies for various portions of theexecution environment.

It is well-known that every processor is adapted to execute instructionsof its particular “architecture”. In other words, every processorexecutes a particular instruction set, which is encoded in a particularmachine language. Some processors, such as the Pentium Pro processor,decode those “macro-instructions” down into “micro-instructions” or“uops”, which may be thought of as the machine language of themicro-architecture and which are directly executed by the processor'sexecution units. It is also well-known that other processors, such asthose of the RISC variety, may directly execute their macro-instructionswithout breaking them down into micro-instructions. For purposes of thepresent invention, the term “instruction” should be considered to coverany or all of these cases.

SUMMARY OF THE INVENTION

The invention provides a microprocessor having two or more levels ofexecution sub-core each clocked at different frequencies. The processormay also have an I/O ring, which may be clocked at yet anotherfrequency. Clock division or multiplication may be used between thevarious levels, to derive the various clocks from a common clock, suchas the I/O clock, which may be provided from off-chip. Having thedifferent clock domains enables the designer to make trade-offs in thedesign of various components of the chip, such as individual executionunits, instruction fetch and decode units, register files, caches, andthe like. Thus, selected components can be designed to operate at a veryhigh frequency, without requiring the entire chip to be designed tooperate at this frequency. Less latency-critical units, or those whoserequired throughput can be obtained by twice as many units running athalf the clock speed, can be relegated to the slower sections of thechip, easing their design considerably.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram illustrating a prior art processor having anI/O ring and an execution core operating at different clock speeds.

FIG. 2 demonstrates a hierarchical memory structure such as is wellknown in the art.

FIG. 3 is a block diagram illustrating the processor of the presentinvention, and showing a plurality of execution core sections eachhaving its own clock frequency.

FIG. 4 is a block diagram illustrating a mode in which the processor ofFIG. 3 includes yet another sub-core with its own clock frequency.

FIG. 5 is a block diagram illustrating a different mode in which thesub-core is not nested as shown in FIG. 4.

FIG. 6 is a block diagram illustrating a partitioning of the executioncore.

FIG. 7 is a block diagram illustrating one embodiment of the replayarchitecture of the present invention, which permits data speculation.

FIG. 8 illustrates one embodiment of the checker unit of the replayarchitecture.

DETAILED DESCRIPTION OF THE INVENTION

FIG. 3 illustrates the high-speed sub-core 205 of the processor 200 ofthe present invention. The high-speed sub-core includes the mostlatency-intolerant portions of the particular architecture and/ormicroarchitecture employed by the processor. For example, in an IntelArchitecture processor, certain arithmetic and logic functions, as wellas data cache access, may be the most unforgiving of execution latency.

Other functions, which are not so sensitive to execution latency, may becontained within a more latency-tolerant execution core 210. Forexample, in an Intel Architecture processor, execution ofinfrequently-executed instructions, such as transcendentals, may berelegated to the slower part of the core.

The processor 200 communicates with the rest of the system (not shown)via the I/O ring 215. If the I/O ring operates at a different clockfrequency than the latency-tolerant execution core, the processor mayinclude a clock mult/div unit 220 which provides clock division ormultiplication according to any suitable manner and conventional means.Because the latency-intolerant execution sub-core 205 operates at ahigher frequency than the rest of the latency-tolerant execution core210, there may be a mechanism 225 for providing a different clockfrequency to the latency-intolerant execution sub-core 205. In one mode,this is a clock mult/div unit 225.

FIG. 4 illustrates a refinement of the invention shown in FIG. 3. Theprocessor 250 of FIG. 4 includes the I/O ring 215, clock mult/div unit220, and latency-tolerant execution core 210. However, in place of theunitary sub-core (205) and clock mult/div unit (225) of FIG. 3, thisimproved processor 250 includes a latency-intolerant execution sub-core255 and an even more latency-critical execution sub-core 260, with theirclock mult/div units 265 and 270, respectively.

The skilled reader will appreciate that this is illustrative of ahierarchy of sub-cores, each of which includes those units which mustoperate at least as fast as the respective sub-core level. The skilledreader will further appreciate that the selection of what units go howdeep into the hierarchy will be made according to various designconstraints such as die area, clock skew sensitivity, design timeremaining before tapeout date, and the like. In one mode, an IntelArchitecture processor may advantageously include only its most commoninteger ALU functions and data storage portion of its data cache in theinnermost sub-core. In one mode, the innermost sub-core may also includethe register file; although, for reasons including those stated aboveconcerning FIG. 2, the register file might not technically be needed tooperate at the highest clock frequency, its design may be simplified byincluding it in a more inner sub-core that is strictly necessary. Forexample, it may be more efficient to make twice as fast a register filewith half as many ports, than vice versa.

In operation, the processor performs an I/O operation at the I/O ringand at the I/O clock frequency, such as to bring in a data item notpresently available within the processor. Then, the latency-tolerantexecution core may perform an execution operation on the data item toproduce a first result. Then, the latency-intolerant execution sub-coremay perform an execution operation on the first result to produce asecond result. Then, the latency-critical execution sub-core may performa third execution operation upon the second result to produce a thirdresult. Those skilled in the art will understand that the flow ofexecution need not necessarily proceed in the strict order of thehierarchy of execution sub-cores. For example, the newly read in dataitem could go immediately to the innermost core, and the result could gofrom there to any of the core sections or even back to the I/O ring forwriteback.

FIG. 5 shows an embodiment which is slightly different than that of FIG.4. The processor 280 includes the I/O ring 215, the execution cores 210,255, 260, and the clock mult/div units 220, 265, 270. However, in thisembodiment the latency-critical execution sub-core 260 is not nestedwithin the latency-intolerant execution core 255. In this mode, theclock mult/div units 265 and 270 perform different ratios ofmultiplication to enable their respective cores to run at differentspeeds.

In another slightly different mode (not shown), either of these coresmight be clock-interfaced directly to the I/O ring or to the externalworld. In such a mode, clock mult/div units may not be required, ifseparate clock signals are provided from outside the processor.

It should be noted that the different speeds at which the various layersof sub-core operate may be in-use, operational speeds. It is known, forexample in the Pentium processor, that certain units may be powered downwhen not in use, by reducing or halting their clock; in this case, theprocessor may have the bulk of its core running at 66 MHz while asub-core such as the FPU is at substantially 0 MHz. While the presentinvention may be used in combination with such power-down or clockthrottling techniques, it is not limited to such cases.

Those skilled in the art will appreciate that non-integer ratios may beapplied at any of the boundaries, and that the combinations of clockratios between the various rings is almost limitless, and that differentbaseline frequencies could be used at the I/O ring. It is also possiblethat the clock multiplication factors might not remain constant overtime. For example, in some modes, the clock multiplication applied tothe innermost sub-core could be adjusted up and down, for examplebetween 3× and 1× or between 2× and 0× or the like, when the higherfrequency (and therefore higher power consumption and heat generation)are not needed. Also, the processor may be subjected to clock throttlingor clock stop, in whole or in part. Or, the I/O clock might not be aconstant frequency, in which case the other clocks may either scaleaccordingly, or they may implement some form of adaptive P/Q clockingscheme to maintain their desired performance level.

FIG. 6 illustrates somewhat more detail about one embodiment of thecontents of the latency-critical execution sub-core 260 of FIG. 4. (Itmay also be understood to illustrate the contents of the sub-core 205 ofFIG. 3 or the sub-core 255 of FIG. 4.) The latency-tolerant executioncore 210 includes components which are not latency-sensitive, but whichare dependent only upon some level of throughput. In this sense, thelatency-tolerant components may be thought of as the “plumbing” whosejob is simply to provide a particular “gallons per minute” throughput,in which a “big pipe” is as good as a “fast flow”.

For example, in some architectures the fetch and decode units may not beterribly demanding on execution latency, and may thus be put in thelatency-tolerant core 210 rather than the latency-intolerant sub-core205, 255, 260. Likewise, the microcode and register file may not need tobe in the sub-core. In some architectures (or microarchitectures), themost latency-sensitive pieces are the arithmetic/logic functions and thecache. In the mode shown in FIG. 6, only a subset of thearithmetic/logic functions are deemed to be sufficientlylatency-sensitive that it is warranted to put them into the sub-core, asillustrated by critical ALU 300.

In some embodiments, the critical ALU functions include adders,subtractors, and logic units for performing AND, OR, and the like. Insome embodiments which use index register addressing, such as the IntelArchitecture, the critical ALU functions may also include a small,special-purpose shifter for doing address generation by scaling theindex register. In some embodiments, the register file may reside in thelatency-critical execution core, for design convenience; the faster thecore section the register file is in, the fewer ports the register fileneeds.

The functions which are generally more latency-sensitive than theplumbing are those portions which are of a recursive nature, or thosewhich include a dependency chain. Execution is a prime example of thisconcept; execution tends to be recursive or looping, and includes bothfalse and true data dependencies both between and within iterations andloops.

Current art in high performance computer design (e.g. the Pentium Proprocessor) already exploits most of the readily exploitable parallelismin a large class of important low P programs. It becomes extraordinarilydifficult or even practically impossible to greatly increase P for theseprograms. In this case there is no alternative to reducing the averagelatency if it is desired to build a processor to run these programsfaster.

On the other hand, there are certain other functions such as forexample, instruction decode, or register renaming. While it is essentialthat these functions are performed, current art has it arranged that thelapsed time for performing these functions may have an effect onperformance only when a branch has been miss predicted. A branch is misspredicted typically once in fifty instructions on average. Hence onenanosecond longer to do decoding or register renaming provides theequivalent of 1/50 nanoseconds increase in average instruction executionlatency while one nanosecond increase in the time to execute aninstruction increases the average instruction latency by one nanosecond.We may conclude that the time it takes to decode instructions or renameregisters, for example, is significantly less critical than the time ittakes to execute instructions.

There are still other functions that must be performed in a processor.Many of these functions are even more highly leveraged than decoding andregister renaming. For these functions 1 nsec increase in the time toperform them may add even less than 1/50 nanoseconds to the averageexecution latency. We may conclude that the time it takes to do thesefunctions is even less critical.

As shown, the other ALU functions 305 can be relegated to the lessspeedy core 210. Further, in the mode shown in FIG. 6, only a subset ofthe cache needs to be inside the sub-core. As illustrated, only the datastorage portion 310 of the cache is inside the sub-core, while thehit/miss logic and tags are in the slower core 210. This is in contrastto the conventional wisdom, which is that the hit/miss signal is neededat the same time as the data. A recent paper implied that the hit/misssignal is the limiting factor on cache speed (Austin, Todd M,“Streamlining Data Cache Access with Fast Address Calculation”,Dionisios N. Pneumatikatos, Giandinar S. Sohi, Proceedings of the 22ndAnnual International Symposium on Computer Architecture, Jun. 18-24,1995, Session 8, No. 1, page 5). Unfortunately, hit/miss determinationis more difficult and more time-consuming than the simple matter ofreading data contents from cache locations.

Further, the instruction cache (not shown) may be entirely in the core210, such that the cache 310 stores only data. The instruction cache(Icache) is accessed speculatively. It is the business of branchprediction to predict where the flow of the program will go, and theIcache is accessed on the basis of that prediction. Branch predictionmethods commonly used today can predict program flow without ever seeingthe instructions in the Icache. If such a method is used, then theIcache is not latency-sensitive, and becomes more bandwidth-constrainedthan latency-constrained, and can be relegated to a lower clockfrequency portion of the execution core.

The branch prediction itself could be latency-sensitive, so it would bea good candidate for a fast cycle time in one of the inner sub-coresections.

At first glance, one might think that the innermost sub-core 205, 255,or 260 of FIG. 6 would therefore hold the data which is stored at thetop of the memory hierarchy of FIG. 2, that is, the data which is storedin the registers. However, as is illustrated in FIG. 6, the registerfile need not be contained within the sub-core, but may, instead, beheld in the less speedy portion of the core 210. In the mode of FIGS. 3or 4, the register file may be stored in any of the core sections 205,210, 255, 260, as suits the particular embodiment chosen. As shown inFIG. 6, the reason that the register file is not required to be withinthe innermost core is that the data which result from operationsperformed in the critical ALU 300 are available on a bypass bus 315 assoon as they are calculated. By appropriate operation of multiplexors(in any conventional manner), these data can be made available to thecritical ALU 300 in the next clock cycle of the sub-core, far soonerthan they could be written to and then read from the register file.

Similarly, if data speculation is permitted, that is, if the criticalALU is allowed to perform calculations upon operands which are not yetknown to be valid, portions of the data cache need not reside within theinnermost sub-core. In this mode, the data cache 310 holds only theactual data, while the hit/miss logic and cache tags reside in a slowerportion 210 of the core. In this mode, data from the data cache 310 areprovided over an inner bus 320 and muxed into the critical ALU, and thecritical ALU performs operations assuming those data to be valid.

Some number of clock cycles later, the hit/miss logic or the tag logicin the outer core may signal that the speculated data is, in fact,invalid. In this case, there must be a means provided to recover fromthe speculative operations which have been performed. This includes notonly the specific operations which used the incorrect, speculated dataas input operands, but also any subsequent operations which used theoutputs of those specific operations as inputs. Also, the erroneouslygenerated outputs may have subsequently been used to determine branchingoperations, such as if the erroneously generated output is used as abranch address or as a branch condition. If the processor performscontrol speculation, there may have also been errors in that operationas well.

The present invention provides a replay mechanism for recovering fromdata speculation upon data which ultimately prove to have beenincorrect. In one mode, the replay mechanism may reside outside theinnermost core, because it is not terribly latency-critical. While thereplay architecture is described in conjunction with amultiple-clock-speed execution engine which performs data speculation,it will be appreciated that the replay architecture may be used with awide variety of architectures and micro-architectures, including thosewhich perform data speculation and those which do not, those whichperform control speculation and those which do not, those which performin-order execution and those which perform out-of-order execution, andso forth.

FIG. 7 illustrates one implementation of such a replay architecture,generally showing the data flow of the architecture. First, aninstruction is fetched into the instruction cache.

From the instruction cache, the instruction proceeds to a renamer suchas a register alias table. In sophisticated microarchitectures whichpermit data speculation and/or control speculation, it is highlydesirable to decouple the actual machine from the specific registersindicated by the instruction. This is especially true in an architecturewhich is register-poor, such as the Intel Architecture. Renamers arewell known, and the details of the renamer are not particularly germaneto an understanding of the present invention. Any conventional renamerwill suffice. It is desirable that it be a single-valued andsingle-assignment renamer, such that each instance of a giveninstruction will write to a different register, although the instructionspecifies the same register. The renamer provides a separate storagelocation for each different value that each logical register assumes, sothat no such value of any logical register is prematurely lost (i.e.before the program is through with that value), over a well-definedperiod of time.

From the renamer, the instruction proceeds to an optional scheduler suchas a reservation station, where instructions are reordered to improveexecution efficiency. The scheduler is able to detect when it is notallowed to issue further instructions. For example, there may not be anyavailable execution slots into which a next instruction could be issued.Or, another unit may for some reason temporarily disable the scheduler.In some embodiments, the scheduler may reside in the latency-criticalexecution core, if the particular scheduling algorithm can schedule onlysingle latency generation per cycle, and is therefore tied to thelatency of the critical ALU functions.

From the renamer or the optional scheduler, the instruction proceeds tothe execution core 205, 210, 255, 260 (indirectly through a multiplexorto be described below), where it is executed. After or simultaneous withits execution, an address associated with the instruction is sent to thetranslation lookaside buffer (TLB) and cache tag lookup logic (TAG).This address may be, for example, the address (physical or logical) of adata operand which the instruction requires. From the TLB and TAG logic,the physical address referenced and the physical address represented inthe cache location accessed are passed to the hit/miss logic, whichdetermines whether the cache location accessed in fact contained thedesired data.

In one mode, if the instruction being executed reads memory, theexecution logic gives the highest priority to generating perhaps only aportion of the address, but enough that data may be looked up in thehigh speed data cache. In this mode, this partial address is used withthe highest priority to retrieve data from the data cache, and only as asecondary priority is a complete virtual address, or in the case of theIntel Architecture, a complete linear address, generated and sent to theTLB and cache TAG lookup logic.

Because the critical ALU functions and the data cache are in theinnermost sub-core—or are at least in a portion of the processor whichruns at a higher clock rate than the TLB and TAG logic and the hit/misslogic—some data will have already been obtained from the data cache andthe processor will have already speculatively executed the instructionwhich needed that data, the processor having assumed the data that wasobtained to have been correct, and the processor likely having alsoexecuted additional instructions using that data or the results of thefirst speculatively executed instruction.

Therefore, the replay architecture includes a checker unit whichreceives the output of the hit/miss logic. If a miss is indicated, thechecker causes a “replay” of the offending instruction and any whichdepended on it or which were otherwise incorrect as a result of theerroneous data speculation. When the instruction was handed from thereservation station to the execution core, a copy of it was forwarded toa delay unit which provides a delay latency which matches the time theinstruction will take to get through the execution core, TLB/TAG, andhit/miss units, so that the copy arrives at the checker at about thesame time that the hit/miss logic tells the checker that the dataspeculation was incorrect. In one mode, this is roughly 10-12 clocks ofthe inner core. In FIG. 7, the delay unit is shown as being outside thechecker. In other embodiments, the delay unit may be incorporated as apart of the checker. In some embodiments, the checker may reside withinthe latency-critical execution core, if the checking algorithm is tiedto the critical ALU speed.

When the checker determines that data speculation was incorrect, thechecker sends the copy of the instruction back around for a “replay”.The checker forwards the copy of the instruction to a buffer unit. Itmay happen as an unrelated event that the TLB/TAG unit informs thebuffer that the TLB/TAG is inserting a manufactured instruction in thecurrent cycle. This information is needed by the buffer so the bufferknows not to reinsert another instruction in the same cycle. Both theTLB/TAG and the buffer also inform the scheduler when they are insertinginstructions, so the scheduler knows not to dispatch an instruction inthat same cycle. These control signals are not shown but will beunderstood by those skilled in the art.

The buffer unit provides latching of the copied instruction, to preventit from getting lost if it cannot immediately be handled. In someembodiments, there may be conditions under which it may not be possibleto reinsert replayed instructions immediately. In these conditions, thebuffer holds them—perhaps a large number of them—until they can bereinserted. One such condition may be that there may be some higherpriority function that could claim execution, such as when the TLB/TAGunit needs to insert a manufactured instruction, as mentioned above. Insome other embodiments, the buffer may not be necessary.

Earlier, it was mentioned that the scheduler's output was provided tothe execution core indirectly, through a multiplexor. The function ofthis multiplexor is to select among several possible sources ofinstructions being sent for execution. The first source is, of course,the scheduler, in the case when it is an original instruction which isbeing sent for execution. The second source is the buffer unit, in thecase when it is a copy of an instruction which is being sent for replayexecution. A third source is illustrated as being from the TLB/TAG unit;this permits the architecture to manufacture “fake instructions” andinject them into the instruction stream. For example, the TLB logic orTAG logic many need to get another unit to do some work for them, suchas to read some data from the data cache as might be needed to evictthat data, or for refilling the TLB, or other purposes, and they can dothis by generating instructions which did not come from the realinstruction stream, and then inserting those instructions back at themultiplexor input to the execution core.

The mux control scheme may, in one mode, include a priority schemewherein a replay instruction has higher priority than an originalinstruction. This is advantageous because a replay instruction isprobably older than the original instruction in the originalmacroinstruction flow, and may be a “blocking” instruction such as ifthere is a true data dependency.

It is desirable to get replayed instructions finished as quickly aspossible. As long as there are unresolved instructions sent to replay,new instructions that are dispatched have a fairly high probability ofbeing dependent on something unresolved and therefore of just gettingadded to the list of instructions that need to be replayed. As soon asit is necessary to replay one instruction, that one instruction tends togrow a long train of instructions behind it that follows it around. Theprocessor can quickly get in a mode where most instructions are gettingexecuted two or three times, and such a mode may persist for quite awhile. Therefore, resolving replayed instructions is very muchpreferable to introducing new instructions.

Each new instruction introduced while there are things to replay is agamble. There is a certain probability the new instruction will beindependent and some work will get done. On the other hand, there is acertain probability that the new instruction will be dependent and willalso need to be replayed. Worse, there may be a number of instructionsto follow that will be dependent on the new instruction, and all ofthose will have to be replayed, too, whereas if the machine had waiteduntil the replays were resolved, then all of these instructions wouldnot have to execute twice.

In one mode, a manufactured instruction may have higher priority than areplay instruction. This is advantageous because these manufacturedinstructions may be used for critically important and time-sensitiveoperations. One such sensitive operation is an eviction. After a cachemiss, new data will be coming from the L1 cache. When that data arrives,it must be put in the data cache (L0) as quickly as possible. If that isdone, the replayed load will just meet the new data and will now besuccessful. If the data is even one cycle late getting the data there,the replayed load will pass again too soon and must again be replayed.Unfortunately, the data cache location where the processor is going toput the data is now holding the one and only copy of some data that waswritten some time ago. In other words, the location is “dirty”. It isnecessary to read the dirty data out, to save it before the new dataarrives and is written in its place. This reading of the old data iscalled “evicting” the data. In some embodiments, there is just exactlyenough time to complete the eviction before starting to write the newdata in its place. The eviction is done with one or more manufacturedinstructions. If they are held up for even one cycle, the eviction doesnot occur in time to avoid the problem described above, and thereforethey must be given the highest priority.

The replay architecture may also be used to enable the processor to ineffect “stall” without actually slowing down the execution core orperforming clock throttling or the like. There are some circumstanceswhere it would be necessary to stall the frontend and/or execution core,to avoid losing the results of instructions or to avoid other suchproblems. One example is where the processor's backend temporarily runsout of resources such as available registers into which to writeexecution results. Other examples include where the external bus isblocked, an upper level of cache is busy being snooped by anotherprocessor, a load or store crosses page boundary, an exception occurs,or the like.

In such circumstances, rather than halt the frontend or throttle theexecution core, the replay architecture may very simply be used to sendback around for replay all instructions whose results would be otherwiselost. The execution core remains functioning at full speed, and thereare no additional signal paths required for stalling the frontend,beyond those otherwise existing to permit the multiplexor to givepriority to replay instructions over original instructions.

Other stall-like uses can be made of the replay architecture. Forexample, assume that a store address instruction misses in the TLB.Rather than saving the linear address to process after getting theproper entry in the TLB, the processor can just drop it on the floor andrequest the store address instruction to be replayed. As anotherexample, the Page Miss Handler (not shown) may be busy. In this case theprocessor does not even remember that it needs to do a page walk, butfinds that out over again when the store address comes back.

Most cases of running out of resources occur when there is a cache miss.There could well be no fill buffer left, so the machine can't evenrequest an L1 lookup. Or, the L1 may be busy. When a cache miss happens,the machine MAY ask for the data from a higher level cache and MAY justforget the whole thing and not do anything at all to help the situation.In either case, the load (or store address) instruction is replayed.Unlike a more conventional architecture, the present invention does notNEED to remember this instruction in the memory subsystem and take careof it. The processor will do something to help it if it has theresources to do something. If not, it may do nothing at all, not evenremember that such a instruction was seen by the memory subsystem. Thememory subsystem, by itself, will never do anything for this instance ofthe instruction. When the instruction executes again, then it isconsidered all over again. In the case of a store address instruction,the instruction has delivered its linear address to the memory subsystemand it doesn't want anything back. A more conventional approach might beto say that this instruction is done, and any problems from here on outare memory subsystem problems, in which case the memory subsystem mustthen store information about this store address until it can getresources to take care of it. The present approach is that the storeaddress replays, and the memory subsystem does not have to remember itat all. Here it is a little more clear that the processor is replayingthe store address specifically because of inability to handle it in thememory subsystem.

In one mode, when an instruction gets replayed, all dependentinstructions also get replayed. This may include all those which usedthe replayed instruction's output as input, all those which are downcontrol flow branches picked according to the replayed instruction, andso forth.

The processor does not replay instructions merely because they arecontrol flow dependent on an instruction that replayed. The thread ofcontrol was predicted. The processor is always following a predictedthread of control and never necessarily knows during execution if it isgoing the right way or not. If a branch gets bad input, the branchinstruction itself is replayed. This is because the processor cannotreliably determine from the branch if the predicted thread of control isright or not, since the input data to the branch was not valid. No otherinstructions get replayed merely because the branch got bad data.Eventually—possibly after many replays—the branch will be correctlyexecuted. At this time, it does what all branches do—it reports if thepredicted direction taken for this branch was correct or not. If it wascorrectly predicted, everything goes on about its business. If it wasnot correctly predicted, then there is simply a branch misprediction;the fact that this branch was replayed any number of times makes nodifference. A mispredicted branch cannot readily be repaired with areplay. A replay can only execute exactly the same instructions overagain. If a branch was mispredicted, the processor has likely done manywrong instructions and needs to actually execute some completelydifferent instructions.

To summarize: A instruction is replayed either: 1) because theinstruction itself was not correctly processed for any reason, or 2) ifthe input data that this instruction uses is not known to be correct.Data is known to be correct if it is produced by a instruction that isitself correctly processed and all of its input data is known to becorrect. In this definition, branches are viewed not as having anythingto do with the control flow but as data handling instructions whichsimply report interesting things to the front end of the machine but donot produce any output data that can be used by any other instruction.Hence, the correctness of any other instruction cannot have anything todo with them. The correctness of the control flow is handled by a higherauthority and is not in the purview of mere execution and replay.

FIG. 8 illustrates more about the checker unit. Again, a instruction isreplayed if: 1) it was not processed correctly, or 2) if it used inputdata that is not known to be correct. These two conditions give a gooddivision for discussing the operation of the checker unit. The firstcondition depends on everything that needs to be done for theinstruction. Anything in the machine that needs to do something tocorrectly execute the instruction is allowed to goof and to signal tothe checker that it goofed. The first condition is therefore talkingabout signals that come into the checker, potentially from many places,that say, “I goofed on this instruction.”

In some embodiments, the most common goof is the failure of the datacache to supply the correct result for a load. This is signaled by thehit/miss logic. Another common goof is failure to correctly process astore address; this would typically result from a TLB miss on a storeaddress, but there can be other causes, too. In some embodiments, the L1cache may deliver data (which may go into the L0 cache and be used byinstructions) that contains an ECC error. This would be signaledquickly, and then corrected as time permits.

In some fairly rare cases, the adder cannot correctly add two numbers.This is signaled by the flag logic which keeps tabs on the adders. Insome other rare cases, the logic unit fails to get the correct answerwhen doing an AND, XOR, or other simple logic operation. These, too, aresignaled by the flag logic. In some embodiments, the floating point unitmay not get the correct answer all of the time, in which case it willsignal when it goofs a floating point operation. In of principle, youcould use this mechanism for many types of goofs. It could be used foralgorithmic goofs and it could even be used for hardware errors (circuitgoofs). Regardless the cause, whenever the processor doesn't do exactlywhat it is supposed to do, and the goof is detected, the processor'svarious units can request a replay by signaling to the checker.

The second condition which causes replays—whether data is known to becorrect—is entirely the responsibility of the checker itself. Thechecker contains the official list of what data is known to be correct.It is what is sometimes called the “scoreboard”. It is the checker'sresponsibility to look at all of the input data for each instructionexecution instance and to determine if all such input data is known tobe correct or not. It is also the checker's responsibility to add it allup for each instruction execution instance, to determine if the resultproduced by that instruction execution instance can therefore be deemedto be “known to be correct”. If the result of a instruction is deemed“known to be correct”, this is noted on the scoreboard so the processornow has new, known-correct data that can be the input for otherinstructions.

FIG. 8 illustrates one exemplary checker which may be employed inpracticing the architecture of the present invention. Because thedetails of the checker are not necessary in order to understand theinvention, a simplified checker is illustrated to show the requirementsfor a checker sufficient to make the replay system work correctly.

In this embodiment, one instruction is processed per cycle. After aninstruction has been executed, it is represented to the checker bysignals OP1, OP1V, OP2, OPV2, DST, and a latency vector which wasassigned to the uop by the decoder on the basis of the opcode. Thesignals OP1V and OP2V indicate whether the instruction includes a firstoperand and a second operand, respectively. The signals OP1 and OP2identify the physical source registers of the first and second operands,respectively, and are received at read address ports RA1 and RA2 of thescoreboard. The signal DST identifies the physical destination registerwhere the result of the instruction was written.

The latency vector has all 0's except a 1 in one position. The positionof the 1 denotes the latency of this instruction. An instruction'slatency is how many cycles there are after the instruction beginsexecution before another instruction can use its result. The scoreboardhas one bit of storage for each physical register in the machine. Thebit is 0 if that register is not known to contain correct data and it is1 if that register is known to contain correct data.

The register renamer, described above, allocates these registers. At thetime a physical register is allocated to hold the result of someinstruction, the renamer sends the register number to the checker asmultiple-bit signal CLEAR. The scoreboard sets to 0 the scoreboard bitwhich is addressed by CLEAR.

The one or two register operands for the instruction currently beingchecked (as indicated by OP1 and OP2) are looked up in the scoreboard tosee if they are known to be correct, and the results are output asscoreboard values SV1 and SV2, respectively. An AND gate 350 receivesthe first scoreboard value SV1 and the first operand valid signal OP1V.Another AND gate 355 similarly receives signals SV2 and OP2V for thesecond operand. The operand valid signals OP1V and OP2V cause thescoreboard values SV1 and SV2 to be ignored if the instruction does notactually require those respective operands.

The outputs of the AND gates are provided to NOR gate 360, along with anexternal replay request signal. The output of the NOR gate will be falseif either operand is required by the instruction and is not known to becorrect, or if the external replay request signal is asserted. Otherwisethe output will be true. The output of the NOR gate 360 is the checkeroutput INSTRUCTION OK. If it is true, the instruction was completedcorrectly and is ready to be considered for retirement. If it is false,the instruction must be replayed.

A delay line receives the destination register identifier DST and thechecker output INSTRUCTION OK information for the instruction currentlybeing checked. The simple delay line shown is constructed of registers(single cycle delays) and muxes. It will be understood that eachregister and mux is a multiple-bit device, or represents multiplesingle-bit devices. Those skilled in the art will understand thatvarious other types of delay lines, and therefore different formats oflatency vectors, could be used.

The DST and INSTRUCTION OK information is inserted in one location ofthe delay line, as determined by the value of the latency vector. Thisinformation is delayed for the required number of cycles according tothe latency vector, and then it is applied to the write port WP of thescoreboard. The scoreboard bit corresponding to the destination registerDST for the instruction is then written according to the value ofINSTRUCTION OK. A value of 1 indicates that the instruction did not haveto be replayed, and a value of 0 indicates that the instruction did haveto be replayed, meaning that its result data is not known to be correct.

In this design, it is assumed that no instruction has physical registerzero as a real destination or as a real source. If there is no validinstruction in some cycle, the latency vector for that cycle will be allzeros. This will effectively enter physical register zero with thelongest possible latency into the delay line, which is harmless.Similarly, an instruction that docs not have a real destination registerwill specify a latency vector of all zeros. It is further assumed thatat startup, this unit runs for several cycles with no valid instructionsarriving, so as to fill the delay line with zeros before the first realinstruction has been allocated a destination register, and hence beforethe corresponding bit in the scoreboard has been cleared. The scoreboardneeds no additional initialization.

Potentially, this checker checks one instruction per cycle (but otherembodiments are of course feasible). The cycle in which an instructionis checked is a fixed number of cycles after that instruction beganexecution and captured the data that it used for its operands. Thisnumber of cycles later is sufficient to allow the EXTERNAL REPLAYREQUEST signal for the instruction to arrive at the checker to beprocessed along with the other information about the instruction. TheEXTERNAL REPLAY REQUEST signal is the OR of all signals from whateverparts of the machine may produce replay requests that indicate that theinstruction was not processed correctly. For example it may indicatethat data returned from the data cache may not have been correct, forany of many reasons, a good example being that there was a cache miss.

It should be appreciated by the skilled reader that the particularpartitionings described above are illustrative only. For example,although it has been suggested that certain features may be relegated tothe outermost core 210, it may be desirable that certain of these residein a mid-level portion of the core, such as in the latency-intolerantcore 255 of FIG. 4, between the outermost core 210 and the innermostcore 260. It should also be appreciated that although the invention hasbeen described with reference to the Intel Architecture processors, itis useful in any number of alternative architectures, and with a widevariety of microarchitectures within each.

While the invention has been described with reference to specific modesand embodiments, for ease of explanation and understanding, thoseskilled in the art will appreciate that the invention is not necessarilylimited to the particular features shown herein, and that the inventionmay be practiced in a variety of ways which fall under the scope andspirit of this disclosure. The invention is, therefore, to be affordedthe fullest allowable scope of the claims which follow.

We claim:
 1. A microprocessor comprising: a first execution core sectionoperating at a first clock frequency; a second execution core sectionoperating at a second clock frequency which is different than the firstclock frequency; and an I/O ring clocked to perform input/outputoperations at an I/O frequency.
 2. The microprocessor of claim 1,wherein the second execution core section operates at least in partconcurrently with the first execution core section.
 3. Themicroprocessor of claim 1, wherein: the second execution core sectionincludes a data cache and critical arithmetic logic unit (ALU)functions; and the first execution core section includes one or more ofan instruction fetch, a decode unit, and non-critical ALU functions. 4.The microprocessor of claim 3, wherein the critical ALU functionscomprise one or more of: an adder; or a logic unit to perform AND and ORoperations.
 5. The microprocessor of claim 4, wherein the critical ALUfunctions further comprise: an address generation index registershifter.
 6. The microprocessor of claim 3, wherein the second executioncore section further includes a register file.
 7. The microprocessor ofclaim 3, wherein the first execution core section further includes aregister file.
 8. The microprocessor of claim 7, wherein: the I/Ofrequency is different than the first and second clock frequencies. 9.The microprocessor of claim 8, further comprising: a first clockdivider/multiplier coupled to the I/O ring and the first execution coresection to divide or multiply the I/O clock frequency to generate thefirst clock frequency; and a second clock divider/multiplier coupled tothe first and second execution core sections to divide or multiply thefirst clock frequency to generate the second clock frequency.
 10. Themicroprocessor of claim 1, wherein the microprocessor comprises asingle, monolithic chip.
 11. The microprocessor of claim 1, wherein thesecond execution core section is disposed within the first executioncore section.
 12. The microprocessor of claim 11, wherein the firstexecution core section is disposed within the I/O ring.
 13. Themicroprocessor of claim 1, wherein the first execution core section andthe second execution core section are located on the same semiconductordie.
 14. The microprocessor of claim 1, wherein the second clockfrequency is a multiple N of the first clock frequency.
 15. Themicroprocessor of claim 1, wherein the second clock frequency is fasterthan the first clock frequency.
 16. The microprocessor of claim 1,wherein the first execution core section is more tolerant of instructionlatency than the second execution core section.
 17. The microprocessorof claim 1, further comprising: a replay architecture, the replayarchitecture causing an instruction to be re-executed.
 18. Themicroprocessor of claim 17, wherein the instruction is re-executed ifthe instruction was incorrectly processed because of erroneous dataspeculation.
 19. The microprocessor of claim 17, wherein an instructiondepending on the instruction that was incorrectly processed because oferroneous data speculation is also re-executed.
 20. The microprocessorof claim 17, wherein the instruction is re-executed if: the instructionwas not correctly processed for any reason; or input data used by theinstruction is not known to be correct.
 21. The microprocessor of claim17, wherein the replay architecture includes: hit/miss logic todetermine whether data speculation for an instruction is correct; achecker unit to receive the output of the hit/miss logic and to directre-execution of the instruction; and a delay unit, the delay unit toprovide a copy of an instruction to the checker unit at substantiallythe same time as the checker unit receives the output of the hit/misslogic.
 22. The microprocessor of claim 21, wherein the delay unit isincorporated as part of the checker.
 23. The microprocessor of claim 21,wherein the checker is located within the second execution core section.24. A method comprising: performing an I/O operation in an I/O ring of amicroprocessor at a first clock frequency to access a data item fromoutside the microprocessor; responsive to the I/O operation, performinga first execution operation upon the data item in a first executionsub-core of the microprocessor at a second clock frequency; andresponsive to the first execution operation, performing a secondexecution operation in a second execution sub-core of the microprocessorat a third clock frequency, the third clock frequency being differentthan the second clock frequency.
 25. The method of claim 24, wherein anexecution operation performed at the third clock frequency is performedat least in part concurrently with an execution operation performed atthe second clock frequency.
 26. The method of claim 24, furthercomprising: multiplying the first clock frequency to generate the secondclock frequency; and multiplying the second clock frequency to generatethe third clock frequency.
 27. The method of claim 24, wherein:execution operations performed at the second clock frequency include oneor more of fetch, decode, and non-critical arithmetic logic unit (ALU)functions; and execution operation performed at the third clockfrequency include critical ALU functions.
 28. The method of claim 24,further comprising re-executing an instruction if the instruction wasincorrectly processed because of erroneous data speculation.
 29. Themethod of claim 28, further comprising re-executing an instruction thatdepends on the instruction that was incorrectly processed.
 30. Themethod of claim 24, further comprising re-executing an instruction if:the instruction was not correctly processed for any reason; or inputdata used by the instruction is not known to be correct.
 31. A methodcomprising: inputting an instruction through operation of a firstportion of a microprocessor at a first clock frequency; performing oneor more fetch functions or decode functions associated with theinstruction through operation of a second portion of the microprocessorat a second clock frequency; and performing one or more criticalarithmetic logic unit (ALU) functions associated with the instructionthrough operation of a third portion of the microprocessor at a thirdclock frequency, the second clock frequency being different than thethird clock frequency.
 32. The method of claim 21, wherein a functionperformed through operation of the second portion of the microprocessorat the second clock frequency occurs at least in part concurrently witha function performed through operation of the third portion of themicroprocessor at the third clock frequency.
 33. The method of claim 31,wherein the second portion of the microprocessor comprises a firstexecution core.
 34. The method of claim 33, wherein the third portion ofthe microprocessor comprises a second execution core.
 35. The method ofclaim 34, wherein the first portion of the microprocessor comprises anI/O section of the microprocessor.
 36. A microprocessor comprising: aplurality of execution core sections, each execution core sectionoperating at a different clock frequency, the plurality of executioncore sections operating at least in part concurrently with each other;an I/O ring clocked to perform input/output operations at an I/Ofrequency.
 37. The microprocessor of claim 36, wherein: a firstexecution core section of the plurality of execution core sectionsincludes one or more of instruction fetch units, instruction decodeunits, and non-critical ALU functions; and a second execution coresection of the plurality of execution core sections includes a datacache and one or more critical arithmetic logic unit (ALU) functions.38. The microprocessor of claim 37, wherein the critical ALU functionscomprise one or more of: an adder; or a logic unit for performing ANDand OR operations.
 39. The microprocessor of claim 37, wherein thecritical ALU functions further comprise:an address generation indexregister shifter.
 40. The microprocessor of claim 37, wherein the secondexecution core section further includes a register file.
 41. Themicroprocessor of claim 37, wherein the first execution core sectionfurther includes a register file.
 42. The microprocessor of claim 36,further comprising a plurality of clock divider/multipliers, each clockdivider/multiplier to divide or multiple a first clock frequency toprovide a second clock frequency to an execution core section.
 43. Themicroprocessor of claim 36, wherein the microprocessor comprises asingle, monolithic chip.
 44. The microprocessor of claim 36, wherein afirst execution core section of the plurality of execution core sectionsis disposed within the I/O ring.
 45. The microprocessor of claim 44,wherein each remaining execution core section of the plurality ofexecution core sections is disposed to be wholly within anotherexecution core section.
 46. The microprocessor of claim 44, wherein eachof the execution core sections is more tolerant of instruction latencythan any execution core sections disposed within it.
 47. Themicroprocessor of claim 36, wherein each of the plurality of executioncore sections is located on the same semiconductor die.
 48. Themicroprocessor of claim 47, wherein the replay architecture includes:hit/miss logic to determine whether data speculation for an instructionis correct; a checker unit to receive the output of the hit/miss logicand to direct re-execution of the instruction; and a delay unit toprovide a copy of an instruction to the checker unit at substantiallythe same time as the checker unit receives the output of the hit/misslogic.
 49. The microprocessor of claim 36, further comprising: a replayarchitecture causing an instruction to be re-executed.
 50. Themicroprocessor of claim 49, wherein the instruction is re-executed ifthe instruction was incorrectly processed because of erroneous dataspeculation.
 51. The microprocessor of claim 50, wherein an instructiondepending on the instruction that was incorrectly processed because oferroneous data speculation is also re-executed.
 52. The microprocessorof claim 51, wherein the delay unit is incorporated as part of thechecker.
 53. The microprocessor of claim 46, wherein the instruction isre-executed if: the instruction was not correctly processed for anyreason; or input data used by the instruction is not known to becorrect.
 54. An integrated circuit comprising: logic to performinput/output (I/O) operations at a first frequency; an arithmetic logicunit (ALU) to operate at a second frequency; a floating-point unit (FPU)to operate at a third frequency, the third frequency being differentthan the second frequency; an instruction cache to cache fetchedinstructions; a renamer unit to rename specific registers indicated byinstructions; a scheduler unit to reorder instructions; and a look-asidebuffer to provide physical addresses of data operands; the instructioncache, renamer unit, scheduler unit, and look-aside buffer to operate ata fourth frequency.
 55. The integrated circuit of claim 54, wherein thethird frequency is half of the second frequency.
 56. The integratedcircuit of claim 55, further comprising an integer register file coupledto the ALU to operate at the second frequency and a floating pointregister file coupled to the FPU to operate at the third frequency. 57.The integrated circuit of claim 54, wherein the fourth frequency is thesame as the second frequency.
 58. The integrated circuit of claim 54,wherein the fourth frequency is slower than the third frequency.
 59. Theintegrated circuit of claim 54, wherein the I/O operations are selectedfrom a group consisting of buffering data, buffering instructions,receiving data, receiving instructions, parity checking, andcommunicating with external devices.
 60. The integrated circuit of claim54, wherein the third frequency is substantially 0 MHz when the FPU ispowered down.
 61. The integrated circuit of claim 54, wherein the secondfrequency is substantially 0 MHz when the ALU is powered down.
 62. Anintegrated circuit comprising: logic to perform input/output (I/O)operations at a first clock frequency; a first arithmetic logic unit(ALU), a first data cache, and a first register file to operate at asecond clock frequency; and a second ALU, a second register file, and asecond data cache to operate at a third clock frequency, the third clockfrequency being different than the second clock frequency.
 63. Theintegrated circuit of claim 62, further comprising a floating-point unit(FPU) to operate at the third clock frequency.
 64. The integratedcircuit of claim 63, wherein the second ALU, second data cache, secondregister file, and the FPU are not nested within the first ALU, firstdata cache, and first register file.
 65. The integrated circuit of claim63, wherein the third clock frequency is faster than the second clockfrequency.
 66. The integrated circuit of claim 62, wherein the secondclock frequency is a multiple of N of the third clock frequency.
 67. Theintegrated circuit of claim 62, wherein the second clock frequency issubstantially 0 when the first ALU, first data cache, and first registerfile are powered down, the third clock frequency being an integermultiple of the first clock frequency.
 68. The integrated circuit ofclaim 62, further comprising: a look-aside buffer operating at a fourthcircuit frequency, the look-aside buffer having a first partitiondedicated to the first ALU, first data cache, and first register fileand a second partition dedicated to the second ALU, second data cache,and second register file.
 69. The integrated circuit of claim 62,further comprising: a first look-aside buffer, a first renamer unit, afirst scheduler unit, and a first hit/miss unit operating at the secondclock frequency; and a second look-aside buffer, a second renamer unit,a second scheduler unit, and a second hit/miss unit operating at thethird clock frequency.
 70. A microprocessor comprising: a fetch unit anda decoder to operate at a first frequency; a multiplier and a firstshifter to operate at a second frequency; and an adder and logic toperform AND and OR operations to operate at a third frequency, the thirdfrequency being different from the second frequency.
 71. Themicroprocessor of claim 70, wherein the first frequency is lower thanthe second frequency, and wherein the third frequency is an integermultiple of the second frequency.
 72. The microprocessor of claim 71,wherein the third frequency is higher than the second frequency by afactor of
 2. 73. The microprocessor of claim 72, wherein the second andthird frequencies are not integer multiples of the first frequency. 74.The microprocessor of claim 72, further comprising: a register file, theregister file coupled to the adder and to the logic; and a secondshifter; the register file and the second shifter to operate at thethird frequency.
 75. The microprocessor of claim 72, further comprisingan instruction cache and a register file to operate at the firstfrequency.